A Study on Mutual Information-based Feature Selection for Text Categorization
نویسندگان
چکیده
Feature selection plays an important role in text categorization. Automatic feature selection methods such as document frequency thresholding (DF), information gain (IG), mutual information (MI), and so on are commonly applied in text categorization. Many existing experiments show IG is one of the most effective methods, by contrast, MI has been demonstrated to have relatively poor performance. According to one existing MI method, the mutual information of a category c and a term t can be negative, which is in conflict with the definition of MI derived from information theory where it is always non-negative. We show that the form of MI used in TC is not derived correctly from information theory. There are two different MI based feature selection criteria which are referred to as MI in the TC literature. Actually, one of them should correctly be termed "pointwise mutual information" (PMI). In this paper, we clarify the terminological confusion surrounding the notion of "mutual information" in TC, and detail an MI method derived correctly from information theory. Experiments with the Reuters-21578 collection and OHSUMED collection show that the corrected MI method’s performance is similar to that of IG, and it is considerably better than PMI.
منابع مشابه
Improving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA
With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...
متن کاملFeature Selection for Text Classification Based on Gini Coefficient of Inequality
A number of feature selection mechanisms have been explored in text categorization, among which mutual information, information gain and chi-square are considered most effective. In this paper, we study another method known as within class popularity to deal with feature selection based on the concept Gini coefficient of inequality (a commonly used measure of inequality of income). The proposed...
متن کاملSupport Vector Machines based Arabic Language Text Classification System: Feature Selection Comparative Study
Feature selection is essential for effective and accurate text classification systems. This paper investigates the effectiveness of six commonly used feature selection methods, Evaluation used an in-house collected Arabic text classification corpus, and classification is based on Support Vector Machine Classifier. The experimental results are presented in terms of precision, recall and Macroave...
متن کاملA Knowledge-Based Feature Selection Method for Text Categorization
A major difficulty of text categorization is the high dimensionality of the original feature space. Feature selection plays an important role in text categorization. Automatic feature selection methods such as document frequency thresholding (DF), information gain (IG), mutual information (MI), and so on are commonly applied in text categorization. Many existing experiments show IG is one of th...
متن کاملA Comparative Study with Different Feature Selection For Arabic Text Categorization
Feature Selection benefits a learner by eliminating non-informative or noisy features and by reducing the overall feature space to a manageable size. The Term Feature Selection is used in Machine Learning for the process of selecting a subset of features used to represent the text. In this paper, we propose a new approach for Text Representation based on incorporating background Knowledge Arabi...
متن کامل